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Abstract 

This work concerns the estimation of multidimensional nonlinear re- 
gression models using multilayer perceptrons (MLPs) . The main problem 
with such models is that we need to know the covariance matrix of the 
noise to get an optimal estimator. However, we show in this paper that 
if we choose as the cost function the logarithm of the determinant of the 
empirical error covariance matrix, then we get an asymptotically optimal 
estimator. Moreover, under suitable assumptions, we show that this cost 
function leads to a very simple asymptotic law for testing the number of 
parameters of an identifiable MLP. Numerical experiments confirm the 
theoretical results. 

keywords non-linear regression, multivariate regression, multilayer Percep- 
trons, asymptotic normality 

1 Introduction 

Let us consider a sequence (Y t , Z t ) teN of i.i.d. (i.e. independent, identically dis- 
tributed) random vectors, with Y t a d-dimcnsional vector. Each couple (Y t ,Z t ) 
has the same law as a generic variable (Y, Z) , but it is not hard to generalize 
all that we show in this paper for stationary mixing variables and therefore for 
time series. We assume that the model can be written as 

Y t = F w o(Z t )+e t 

where 

• F\yo is a function represented by an MLP with parameters or weights W°. 



(et) is an i.i.d.-centered noise with unknown invertible covariance matrix 

r . 



i 



This corresponds to multivariate non-linear least square model, as in chap- 
ters 3.1 and 5.1 of Gallant [5]. Indeed, an MLP function can be seen as a 
parametric non-linear function, for example an one hidden layer MLP using 
hyperbolic tangent as transfert functions (tanh) can be written F w o(Z t ) = 

(Fwo(Zt),- ■ ■ , -F^, (i?i)) T , where T denotes the transposition of the matrix, 
with : 



where H is the number of hidden units and L is the dimension of the input z, 
then the parameter vector is 



There are some obvious transformations that can be applied to an MLP 
without changing its input-output map. For instance, suppose we pick an hidden 
node j and we change the sign of all the weights Wij for i = 0, • • • , H, and 
also the sign of all a«j for i = 0, • • • , d. Since tanh is odd, this will not alter 
the contribution of this node to the total net output. Another possibility is 
to interchange two hidden nodes, that is, to take two hidden nodes ji and j% 
and relabel j\ as ji and ji as j%, taking care to also relabel the corresponding 
weights. These transformations form a finite group (see Sussmann [TO]). 

We will consider equivalence classes of one hidden layer MLPs: two MLPs are 
in the same class if the first one is the image by such transformation of the second 
one, the considered set of parameters is then the quotient space of parameters by 
this finite group. In this space, we assume that the model is identifiable it means 
that the true model belongs to the considered family of models and that we 
consider MLPs without redundant units. This is a very strong assumption but 
it is known that estimated weights of an MLPs with redundant units can have 
a very strange asymptotic behavior (see Kukumizu [3]), because the Hessian 
matrix is singular. The consequence of the identifiability of the model is that the 
Hessian matrix computed in the sequel will be definite positive (see Fukumizu 
|3J). In the sequel we will always assume that we are under the assumptions 
making the Hessian matrix definite positive. 

1.1 Efficient estimation 

A popular choice for the associated cost function is the mean square error: 



where ||.|| denotes the Euclidean norm on M. d . Although this function is widely 
used, it is easy to show that we then get a suboptimal estimator, with a larger 
asymptotic variance that the estimator minimizing the generalized mean square 
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error : 

1 " 

F w ( Z *)f ^ (Yt - F W (Z t )) (2) 

t=i 

But, we need to know the true covariance matrix of the noise to use this cost 
function. A possible solution is to use an approximation T of the covariance 
error matrix To to compute the generalized least squares estimator : 

1 n 

-y2(Y t -F w (Z t )) T T- l (Y t -F w (Z t )) (3) 
t=i 

A way to construct a sequence of (rfc) fegN » yielding a good approximation of 
To is the following: using the ordinary least squares estimator W^, the noise 
covariance can be approximated by 

r i: =r(^) :=-Y,(Y t -F W i(Z t ))(Y t -F^(Z t )) T . (4) 
t=i 

then, we can use this new covariance matrix to find a generalized least squares 
estimator W%: 

1 " 

Wl = argmin - £ (Y t - F w (Z t )f (r^ 1 (Y t - F w (Z t )) (5) 
w n z — ' 
t=i 

and calculate again a new covariance matrix 

1 " 

T 2 := r (Wl) = - J2(Y t ~ F n {Z t )){Y t - F^(Z t )f. 
t—i 

It can be shown that this procedure gives a sequence of parameters 

W n -> Tx -» wl -» r 2 -» • ■ • 

minimizing the logarithm of the determinant of the empirical covariance matrix 
(see chapter 5 in Gallant [5]) : 

U n (W) := logdet f i YjJ t - F w (Z t )){Y t - F w {Z t )) T ^j (6) 

The use of this cost function for neural networks has been introduced by Williams 
in 1996 [12] . however its theoretical and practical properties have not yet been 
studied. Here, the calculation of the asymptotic properties of U n iW) will show 
that this cost function leads to an asymptotically optimal estimator, with the 
same asymptotic variance that the estimator minimizing @, we say then that 
the estimator is "efficient" . 
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1.2 testing the number of parameters 



Let q be an integer less than s, we want to test "Hq : W £ 8 9 C M 9 " against 
il Hi : W S 6 S C R s ", where the sets g and S are compact and Q q C 0^. 
Hq expresses the fact that W belongs to a subset 9 g of <d s with a parametric 
dimension lesser than s or, equivalently, that s — q weights of the MLP in S 
are null. If we consider the classical mean square error cost function: V^(W^) = 
StLi II ^* — Pw{Zt)\\ 2 , we get the following test statistic: 



S n = nx[ min VJW) - min V„(W) 

Under the null hypothesis Hq, it is shown in Yao [13] that S n converges in law 
to a weighted sum of xl 

s-q 
i=l 

where the xt i are S ~Q i-i-d. x\ variables and are strictly positives eigenvalues 
of the asymptotic covariance matrix of the estimated weights, different from 1 
if the true covariance matrix of the noise is not the identity matrix. So, in the 
general case, where the true covariance matrix of the noise is not the identity 
matrix, the asymptotic distribution is not known, because the A^s are not known 
and it is difficult to compute the asymptotic level of the test. 

However, if we use the cost function U n (W) then, under Hq, the test statis- 
tic: 



T n = n X min U n (W) - min U n (W) (7) 

will converge to a classical xl- q so the asymptotic level of the test will be very 
easy to compute. This is another advantage of using the cost function in Eq. 
([6]) . Note that this result is true even if the noise is not Gaussian (it is more 
general that the maximum likelihood estimator) and without knowing the true 
covariance of the noise Tq, so without using the cost function Q or even an 
approximation of it. 

In order to prove these properties, the paper is organized as follows. First we 
compute the first and second derivatives of U n (W) with respect to the weights 
of the MLP, then we deduce the announced properties with classical statistical 
arguments. Finally, we confirm the theoretical results with numerical experi- 
ments. 



2 The first and second derivatives of W i — ► U n (W) 

First, we introduce a notation: if F\y{X) is a <i-dimensional parametric func- 
tion depending on a parameter vector W, let us write dF ^^ (resp. g^Iaw? ) 
for the ci-dimensional vector of partial derivatives (resp. second order partial 
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derivatives) of each component of Fw{X). Moreover, if is a matrix de- 

pending on W, let us write -g^r^{W) the matrix of partial derivatives of each 
component of T(W). 

2.1 First derivatives 

Now, if T n (W) is a matrix depending on the parameter vector W, we get (see 
Magnus and Neudecker [8]) 

^iogdet(r nW ) = tr (r-^wo^Mwo) . 

Here 

n 

r n (w) = - yVfc - - iV(z t )) T . 

t=l 

Note that this matrix T n (W) and it inverse are symmetric. Now, if we note 
that 

then, using the fact 

tr (T- 1 (W)A n (W k )) = tr {Al{W k )Y^ (W)) = tr {T- 1 (W)A T n {W k )) , 
we get 

'' \ogdet{T n (W)) = 2tr(T- 1 (W)A n (W k )). (8) 



dWi 



2.2 Calculus of the derivative of W \ — > U n (W) for an MLP 

Let us note (r„ (W))^ (resp. (r^ 1 {W)) ..) the element of the ith line and jth 

column of the matrix T n (W) (resp. T" 1 (W)). We note also Fw{z t )(i) the ith 
component of a multidimensional function and for a matrix A — (Aij ) , we note 
that (Aij) 1<i . <d is the vector obtained by concatenation of the columns of A. 
Following the previous results, we can write for the derivative of log(det(r„ (W))) 
with respect to the weight W k : 



-(log(det(r n (W)))) = ((T- 1 (W)) ) 



dW k ov v v "" V v n v 1 n i ) \<i,i<d\ dW k 

with 



l<ij<a 



1 ™ 

n ^ 

t=i 



9T !I = 
dW k 

dF w (z t )(i) dF w (z t ){j) . 
^7 x (y t - F w (zt)) (j) Tjyy [yt - Fw(zt)) (i) 



(9) 
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so 

^-(io g (det(r„(^)))) = i(r-)^x 

dF w {z t ){i) dF w (z t ){j) \ 

L — x fa - F ^)) (?) to - W 

\t=l " / l<i,i<d 

(10) 

The quantity dFw g ^^ is computed by back propagating the constant 1 
for the MLP restricted to the output i. Figure [T] gives an example of an MLP 
restricted to the output 2. 



Figure 1: MLP restricted to the output 2 : the plain lines 




Hence, the calculus of the gradient of U n (W) with respect to the parameters 
of the MLP is straightforward. We have to compute the derivative with respect 
to the weights of each single output MLP extracted from the original MLP by 
back propagating the constant value 1, then according to the formula ([9]), we can 
compute easily the derivative of each term of the empirical covariance matrix 
of the noise. Finally the gradient is obtained by the sum of all the derivative 
terms of the empirical covariance matrix multiplied by the terms of its inverse 
as in formula (fTUl) . 
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2.3 Second derivatives 

We write now 



and 



We get 



n / 

B n {W k ,Wi) := - V 
1 - / 

C n (W k ,Wi) := - V -(ft - F w {z t )) 



dF w (z t ) dF w (z t f \ 
dW k 8Wi J 

d 2 F w (z t ) ' 
dW k 3W l 



WM=ok 2tr ( T n 1 (W)A n (W k )) = 

2*r( g^gU (W fe ))+2tr (r- 1 (W0B n (W fc) Wi)) +2*r (r^)- 1 ^^, W,)) 

Now, Magnus and Neudecker [5] give an analytic form of the derivative of an 
inverse matrix, from which we get 

5^ = 2tr (T-\W) (A n {W k ) + Al{W k )) T^(W)A n (W k )) + 
2tr (T- 1 {W)B n {W k , Wi)) + 2tr (T~ 1 (W)C n {W k ,Wi)) 

and 

S = Ur ^n 1 {W)A n {W k )T-\W)A n {W k )) (n) 
+2tr(T- 1 (W)B n (W k ,W l )) +2tr(T- 1 (W)C n (W k ,W l )) 

3 Asymptotic properties 

In the sequel, we will assume that the square of the noise e is integrable and 
that the cube of the variable Z is integrable too. Moreover, it is easy to show 
that, for an MLP function, there exists a constant C such that we have the 
following inequalities : 

||^P||<C(1 + ||^||) 

\\^§\\<c(i + \\zr) 

These inequalities will be important to get the local asymptotic normality prop- 
erty implying the asymptotic normality of the parameter minimizing U n (W). 

3.1 Consistency and asymptotic normality of W n 

First we have to identify the contrast function associated with U n ( W) 
Lemma 1 

U n {W) - U n (W°) a ^ K{W, W°) 
with K(W, W°) > and K(W, W°) = if and only ifW = W°. 
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Proof: Let us note 

T{W) =E((Y- F W (Z))(Y - F W (Z)) T ) 

the expectation of the covariance matrix of the noise for model parameter W. 
By the strong law of large numbers we have 

U n (W) - U n (W°) ^ logdet(r(W0) - logdet(r(^ )) = log grffl}) = 
logdet (T-^W ) (T(W) - T(W )) + I d ) 

where Id denotes the identity matrix of W 1 . So, the lemma is true if r(VK) — 
r(W /0 ) is a positive matrix, null only if W = W°. But this property is true 
since 

r(W) =E((Y- F W {Z)){Y - F W {Z)) T ) = 

E((Y- F w o(Z) + F w o(Z) - F W (Z))(Y - F W «(Z) + F W «(Z) - F W {Z)) T ) = 

E (Y - F WQ {Z))(Y - F wa (Z)) T ) + 

E ((F w o(Z) - F w (Z))(F w o(Z) ~ F W {Z)) T ) = 

r(W°) + E ((F w o(Z) - F w (Z))(F w o(Z) - F W {Z)) T ) 

and the lemma follows from the identifiability assumption ■ 
We deduce the theorem of consistency: 

Theorem 1 We have 

W n a -4' W° 

Proof Remark that a constant B exists such that 

sup we e 3 \\Y - F W (Z)\\ 2 < ||r|| 2 + B (12) 

because Q s is compact, so F\y(Z) is bounded. Let us define the function 

$(r) :=max(logdet(r),(ilog(5)) 

where d is the dimension of the observations Y and 8 > strictly smaller than 
the smallest eigenvalue of To, since Tq is definite positive we have for all W: 

lim $(T„(W0) °= lim logdet(r n (W)) = K(W, W°) + logdet(r°) > dlog(<5) 



n — ^oo 



Now, for all W, thanks to the inequality (|12p there exists constants a and (3 
such that 

|$ ((Y - F W {Z))(Y - F w (Z)f) | <■ a||Y|| 2 + p 

but the right hand of this inequality is integrable, so the function $ as an 
integrable envelope function and by example 19.8 of van der Vaart [TT] the set 
of functions {$ ((Y - F W {Z)){Y - F W {Z)) T ) , W S Q s } is Glivenko-Cantelli. 

Now, the theorem 5.7 of van der Vaart [TT], shows that W n converges in 
probability to W°, but it is easy to show that this convergence is almost sure. 
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First, by lemma [TJ we remark that for all neighborhood AT of W° their exists a 
number r)(J\f) > such that for all W ^ M we have 

logdet {T(W)) > logdet (T{W )) + r)(J\f) 

Now to show the strong consistency property we have to prove that for all 
neighborhood TV of W we have linin^oo W n C N or, equivalently, 

lim logdet (T(W n )) - logdet (T(W )) < r)(J\f) 

By definition, we have 

logdet (T n (W n )) < logdet {T n (W )) 

and the Glivenko-Cantelli property assures that 

lim logdet (r„ (W°)) -log det (T(W )) =' lim $(r„(VK)) -logdet (T(W )) =' 

n— i-oo n— >oo 

therefore 

lim logdet (T n (W n )) < logdet (r(l¥ )) + ^ 

We have also 

lim logdet ( T n (W n )) -log det (r(W n j) =' lim <Z>{T n (W)) -log det (T(W n )) a = 

n— >oo \ / \ / n — >-og \ / 

and finally 

Jim logdet (T(W n f) - ^ < logdet (r n (W n )) < logdet (T(W )) + ^Li 
■ 

Now, we can establish the asymptotic normality for the estimator. 

Lemma 2 Let AC/^W-" ) 6e tfie gradient vector of U n (W) at W° , AU(W°) be 
the gradient vector of U(W) := logdet (T(W)) at W° and HU n (W°) be the 
Hessian matrix of U n (W) at W° . 
We define finally 

B { W^ Wl ):= dFw{Z)dFw{Z)T 
K fe ' 11 dW k dWi 

Then we get 

1. HU n (W°) a -$ 2I 

2. VnAf7 n (W°) L -T^(0,4J ) 

where, the component (fc, I) of the matrix Iq is : 

tr (T^E{B(W a k ,W?))) 
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proof First we note 

To prove the lemma, we remark first that the component (k,l) of the matrix 
4J is : 

e i^ a W) - E (2,r (r » 1aT{<)) y 2tr (r ° lA{w ° >» 

and, since the trace of the product is invariant by circular permutation, 

- V aw* owf 



(- gg tf 1 rj 1 (y - F W o(z))(r - f w0 {z)) t t^ (- 



aw, 



. p / 3F H ,o(Z) T r -l 3F,,- c (Z) 
^ I 3W t 1 dWi 
Air (V- 1 F ( dF wo( z ) dF w0 (Z) T 

Itr | 1 n Ej I — 



4ir (T^E(B(W°,W?))) 



Now, for the component (k, I) of the expectation of the Hessian matrix, we 
remark that 

km tr(T- 1 (W )A n (W%)T- 1 (W )A n (W%))=0 

n — >oo 

and 

lim trT~ 1 C n (W°,W?) = 

n — >oo 

SO 

lim™ H n (W°) = lim™ 4ir (T^iW^MW^T-^W^A^W^)) + 
2trT- 1 (W°)B n (W°, W?) + 2irr- 1 C„(VK fc °, ) = 
= 2ir (ro^^Wj.W?))) 
■ 

Now, from a classical argument of local asymptotic normality (see for exam- 
ple Yao [13]), we deduce the following property for the estimator W n : 

Proposition 1 We have 

lim JH(W n - W°) = 7V(0, /q" 1 ) 

n — >oo 

However, if W* is the estimator of the generalized least squares : 

1 " 

W* := argmin - V (F t - F w (Z t )) T r„ 1 (Y t - F w (Z t )) 
then we have also 



71 



lim i/n(W* — W°) = Af (0, J^ 1 ) 
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so W n has the same asymptotic behavior as the generalized least squares es- 
timator with the true covariance matrix T^" 1 which is asymptotically optimal 
(see for example Ljung [7]). Therefore, the proposed estimator is asymptotically 
optimal too. 

3.2 Asymptotic distribution of the test statistic T n 

Let us assume that the null hypothesis Hq is true, we write 
W n = argminvi/ G e s U n {W) and W® = argmhnyee, U n (W), where <d q is viewed 
as a subset of G s . The asymptotic distribution of T n is then a consequence of 
the previous section. Namely, if we replace nU n {W) by its Taylor expansion 
around W n and W®, following van der Vaart [TT] chapter 16 we have : 

T n = v^(w n - w£) T i ay /a (w n - w°) + o P (i) £ xl- q 

4 Experimental results 
4.1 Simulated example 

Although the estimator associated with the cost function U n (W), is theoreti- 
cally better than the ordinary mean least squares estimator, it is of some interest 
to quantify this fact by simulation. Moreover, there are some pitfalls in practical 
situations with MLPs. 

The first point is that we have no guaranty to reach the global minimum 
of the cost function, we can only hope to find a good local minimum if we are 
using many estimations with different initial weights. 

The second point, is the fact that MLP are black box, it means that it is 
difficult to give an interpretation of their parameters and it is almost impossible 
to compare MLP by comparing their parameters even if we try to take into 
account the possible permutations of the weights, because the difference between 
the weights may reflect only the differences of local minima reached by weights 
during the learning. 

All these reasons explain why we choose, for simplicity, to compare the 
estimated covariance matrices of the noise instead of comparing directly the 
estimated parameters of MLPs. 

4.1.1 The model 

To simulate our data, we use an MLP with 2 inputs, 3 hidden units, and 2 out- 
puts. We choose to simulate an auto-regressive time series, where the outputs 
at time t are the inputs for time t + 1. Moreover, with MLPs, the statistical 
properties of such a model are the same as with independent identically dis- 
tributed (i.i.d.) data, because the time series constitutes a mixing process (see 
Yao H3|). 
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The equation of the model is the following 

Yt+i = F Wo (Y t ) + e t+1 

where 

• Y = (0,0). 

• (^*)i<*<iooo> Y G R 2 , is the bidimensional simulated random process 

• Fw is an MLP function with weights Wo chosen randomly between —2 
and 2. 

• (st) is an i.i.d. centered noise with covariance matrix Tq — 

In order to study empirically the statistical properties of our estimator we make 
400 independent simulations of the bidimensional time series of length 1000. 

4.1.2 The results 

Our goal is to compare the estimator minizing U n (W) or equation ([6|) and the 
weights minimizing the mean square error (MSE), equation (JTJ) . For each time 
series we estimate the weights of the MLP using the cost function U n (W) and 
the MSE. The estimations have been done using the second order algorithm 
BFGS, and for each estimation we choose the best result obtained after 100 
random initializations of the weights. Thus, we avoid plaguing our learning 
with poor local minima. 

We show here the mean of estimated covariance matrices of the noise for 
U n (W) and the mean square error (MSE) cost function: 

TT f . ( 4.960 3.969 \ A _ / 4.938 3.932 \ 

U »W : { 3.969 4.962 J and : { 3.932 4.941 J 

The estimated standard deviation of the terms of the matrices are all equal 
to 0.01, so the differences observed between the two matrices are statistically 
significant. We can see that the estimated covariance of the noise is on average 
better with the estimator associated to the cost function U n (W), in particular 
it seems that there is slightly less overfitting with this estimator, and the non 
diagonal terms are greater than with the least squares estimator. As expected, 
the determinant of the mean matrix associated with U n (W) is 8.86 instead of 
8.93 for the matrix associated with the MSE. 

4.2 Application to real time series: Pollution of ozone 

Ozone is a reactive oxidant, which is formed both in the stratosphere and tropo- 
sphere. Near the ground's surface, ozone is directly harmful to human health, 
plant life and damages physical materials. The population, especially in large 
cities and in suburban zones which suffer from summer smog, wants to be warned 
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of high pollutant concentrations in advance. The statistical ozone modelling and 
more particularly regression models have been widely studied pQ, [6]. Gener- 
ally, linear models do not seem to capture all the complexity of the phenomena. 
Thus, the use of nonlinear techniques is recommended to deal with ozone pre- 
diction. Here we want to predict ozone pollution in two sites at the same time. 
The sites are the pollution levels in the south of Paris (13th district) and on the 
top of the Eiffel Tower. As these sites are very near each other we can expect 
that the two components of the noise are very correlated. 

4.2.1 The model 

The neural model used in this study is autoregressive and includes exogenous 
parameters (called NARX model), where X stands for exogeneous variables. 
Our aim is to predict the maximum level of ozone pollution of the next day 
knowing the today's maximum level of pollution and the maximal temperature 
of the next day. If we note Y 1 the level of pollution for Paris 13, Y 2 the level 
of pollution for the Eiffel Tower and Temp the temperature, the model can be 
written as follows: 

(1?+!, Y t 2 +1 ) = F w (Y t \Y t 2 ,Tem Pt+1 ) + e t+1 (13) 

We will assume that the variables are mixing as previously. As usual with real 
time series, overtraining is a crucial problem. MLPs are very overparametrized 
models. This occurs when the model learns the details of the noise of the 
training data. Overtrained models have very poor performance on fresh data. To 
avoid overtraining we use in this study the SSM pruning technique, a statistical 
stepwise method using a BIC-like criterion (Cottrell et al [2]). The MLP with 
the minimal dimension is found by the elimination of the irrelevant weights. 
Here, we will compare behavior of this method for both cost function: The 
mean square error (MSE) and the logarithm of the determinant of the empirical 
covariance matrix of the noise (U n (W)). 

4.2.2 The dataset 

This study presents the ozone concentration of the Air Quality Network of the 
He de France Region (AIRPARIF, Paris, France). The data used in this work 
are from 1994 to 1997, we use only the months from April to September inclusive 
because there is no peak during the winter period. According to the model, we 
have the following parameters: 

• The maximum temperature of the day 

• Persistence is used by introducing the previous day's peak ozone. 

Before their use in the neural network, all these data have been centered and 
normalized. The data used to train the MLPs are chosen randomly in the whole 
period and we leave 100 observations to form a fresh data set (test set), which 
will be used for models evaluation. In order to evaluate the models we repeat 
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400 times this random sampling to get 400 covariance matrices on each set for 
the two cost functions. Figure[2]is a plot of the centered and normalized original 
data. 



Figure 2: Ozone time series 
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4.2.3 The results 

For the learning set, we get the following results for the averaged covariance 
matrix (the estimated standard deviation for the coefficients is about 0.0005): 

TT nxn ( 0.27 0.20 \ , A/rc „ / 0.27 0.18 \ 
U »W ■■ [ o.20 0.34 J and : [ 0.18 0.34 J 

for the test set, we get the following results for the averaged covariance matrix 
(the estimated standard deviation for the coefficients is about 0.002) : 

TT mr . ( 0.29 0.22 \ A/rcl7 / 0.33 0.20 \ 
U »W : { 0.22 0.36 J and : { 0.20 0.39 J 

The two matrices are almost the same for the learning set, however the non- 
diagonal terms are greater for the U n (W) cost function. Moreover, looking at 
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the averaged matrix on the test set, we see that the generalization capabilities 
are better for U n (W) and the differences are statistically significant. Generally, 
the best MLP for U n (W) has less weights than the best MLP for the MSE cost 
function. Hence, the proposed cost function leads to a somewhat more parsimo- 
nious model, because the pruning technique is very sensitive to the variance of 
estimated parameters. This gain is valuable regarding the generalization capac- 
ity of the model, because the difference is almost null for the learning data set 
but is greater on the test data. Figure [3] is a plot of the centered and normalized 
original test data and its prediction. 

Figure 3: Predicted time series 



Plot of the predictions of ozone test series 
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5 Conclusion 

In the linear multidimensional regression model the optimal estimator has an 
analytic solution (see Magnus and Neudecker [8]), so it does not make sense to 
consider minimization of a cost function. However, for the non-linear multidi- 
mensional regression model, the ordinary least squares estimator is sub-optimal, 
if the covariance matrix of the noise is not the identity matrix. We can over- 
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come this difficulty by using the cost function U n {W) = logdet(r„(W / )). In this 
paper, we have provided a proof of the optimality of the estimator associated 
with U n {W). Statistical thought tells us that it is always better for the neural 
networks practitionners to use a more efficient estimator because such estimator 
are better on average, even if the difference seems to be small. This estimator is 
especially important if the pratitionners are using pruning techniques. Indeed 
pruning technique are based on Wald test or approximated Wald test as for the 
optimal brain damage or optimal brain surgeon method (see Cottrell et al. 0) 
and these tests are very sensitive to the variance of the estimated parameters. 
Moreover, we have shown that this cost function leads to a simpler % 2 test to 
determine the number of weights if the model is identifiable. These theoretical 
results have been confirmed by a simulated example, and we have see for a real 
time series that we can expect slight improvement especially in model selection, 
this confirms the fact that such techniques are very sensitive to the variance of 
the estimated weights. 
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